SpatialDeconTest

The goal of SpatialDeconTest is to provide a tool that enables the deconvolution of cell types and cell type proportions present within each spot obtained from 10X’s visium - spatial trancsiptomics- technology.

Graphical abstract

Graphical abstract

Installation

You can install the latest version from the GitHub repository SpatialDeconTest with:

Tutorial

This is a basic example on a simple dataset to illustrate how to run the workflow:

Single Cell Mixology

In this step by step analysis we will assess how the deconvolution performs on the single cell mixology data generate by Matthew E. Ritchie in his paper Benchmarking single cell RNA-sequencing analysis pipelines using mixture control experiments. All data can be accessed in this sc_mixology github repository. It uses 5 different cell types for the scRNAseq: human lung adenocarcinoma cell lines HCC827, H1975, H2228, H838, and A549. To do the mixed spots it only uses the first 3 (HCC827, H1975, H2228) to do 9 cell combinations.

We are going to use this data since it is put out to carry out benchmarking experiments and is a good controled way of knowing wich combination of 9-cells is in each mixture.

Downsampling + Data preprocessing

If the dataset is very large we want to downsample it, both in terms of number of cells and number of genes, to train the model. To do this downsampling we want to keep a representative amount of cells per cluster and the most important genes. We will select first the genes of interest, to do so we will grab each cluster’s markers plus the 5000 most variable genes.
We can extract the top marker genes from each cluster and select the unique ones to use as seeds for the model

Train LDA model

Once we have the data ready to pass to the model we can train it as shown below. For larger and more complex datasets ~8000 gibbs iterations are recommended. If more iterations are needed you can always resume where you left off and run more iterations on the previosuly trained model.

Spot Deconvolution

Create all posible combinations between 3-8 cells per spot. We get the cell composition and the topic profile por each combination. We will compare the topic profiles of all these synthetic spots with the predicted topic profiles obtained from running the spatial spots through the LDA model. The prediction process can be parallelized to speed it up.

Mixed spots representing the grount truth are obtained from the sce_9cells_qc object where all possible combinations of HCC827, H1975, H2228 were carried out.

Assess the performance comparing with the ground truth. We need to remove the 90 cell spots:

BONUS - Benchmark with synthetic test spots

Furthermore, we can test the performance of the model on synthetically generated test spots to get a sense on how well the model will perform on your data.

First of all we can check if the model has achieved an optimal solution by checking if the maximum-likelyhood has plateaud.

If the model hasn’t plateaud or has plateaud below the best model with the set number of iterations we can run more iterations on the model right where we left off. We can change the nstart parameter which will run the model at 3 different stochastic start points to try to find a better local minima.

If the model had already reached a local minima or we have reached it by re-training the model we can assess its performance with synthetically generated test spots. We can generate test spots with the function test_spot_fun() below.

We then predict the topic profiles of these synthetic generated spots

Lastly, we can assess its performance with the function test_synthetic_performance()

Step-by-step workflow

If you want to understand a bit better what is going on in the previous step here is the workflow executed step-by-step. All the above steps are to show how the workflow works and make it available and understandable.

Normalizing and Transform the data, dimensionality reduction and clusters assignment.

1st standard viz

QC plots

From the QC plots we can see that library size, number of detected genes and mitochondrial gene proportion follow a normal distribution with not much skewness. Furthermore the mitochondrial proportion does not show a correlatino with the number of detected genes so we wont regress out the effect ot mitochondrial proportion. No obvious low quality cells remain in this dataset so we won’t exclude any further ones.

Looking at the clustering we can clearly see that the cells cluster by cell line.

UMAP demuxlet

If we take a look at cell cycle genes to see if we need to correct by it we see that the cells are pretty synchronized. Since it is always better to not over-correct the data we won’t regress out the cell cycle.

Feature plot Cell cycle

Downsampling dataset

If the dataset is very large we want to downsample it, both in terms of number of cells and number of genes, to train the model. To do this downsampling we want to keep a representative amount of cells per cluster and the most important genes. We will select first the genes of interest, to do so we will grab each cluster’s markers plus the 5000 most variable genes.
We can extract the top marker genes from each cluster and select the unique ones to use as seeds for the model

Select only the 3 cell types used in the mixed spots

Combine marker genes and highest variable genes and subset genes

Get cell IDs to subset by cluster

Subset seurat object

Train model

We will set some parameters the we will use to train the model. K is the numer of topics which we will assume to be the same as the number of clusters found by Seurat. With droplevels we make sure that there are no levels defined with no representation

We then will get the dataset ready to pass to the LDA model

Select up to top 100 marker genes for each cluster

Select unique markers from each cluster, if there are common markers between clusters lda model gets confused and classifies very different clusters as belonging to the same topic just because the seeding induced it!

Set seedwords from top markers. Here we are setting the weights for each topic, the words that are weighted positively are those belonging to the list of top markers for a cluster. In the seedgenes matrix each row represents a topic and each column represents a gene. To the LDA model we need to pass a matrix with k rows and ngene columns, where each cell has the weight of that gene for that topic. The weight we’re assigning is the logFC

Predict topic profiles

Mixed spots representing the grount truth are obtained from the sce_9cells_qc object where all possible combinations of HCC827, H1975, H2228 were carried out.

Now to predict the topic profiles of the mixed spots…

Generate all posible synthetic spots

Below we generate all the possible spot combinations containing between 2-8 cells since we aproximate that by their size the spots won’t have more than ~10 cells

#### Calculate topic profiles for every cluster ####
clust_profiles <- topic_profile_per_cluster(lda_mod = lda_mod, se_obj = se_sc_10x_5cl_qc, clust_vr = clust_vr)
round(clust_profiles,4)

if(class(clust_profiles) != "matrix") clust_profiles <- as.matrix(clust_profiles)

# If a cluster is 0 change it to 1
if(sum(grepl(pattern = "0",rownames(clust_profiles))) != 0){
  rownames(clust_profiles) <- as.character(as.numeric(rownames(clust_profiles))+1)
}

# Compute all possible combinations up to grabbing round(nrow(comb)*0.66)
k_sub <- 8
comb <- combinations(x = c(0:(nrow(clust_profiles))), k = k_sub, replace=TRUE)

# Remove all those combinations that only include 1 or 2 cells
comb <- comb[rowSums(comb != 0) > 2,]


# Create all possible combinations
## Initialize matrix for increased speed so that it doesn't need to create indexes on the fly
tmp_mtrx <- matrix(nrow = nrow(comb), ncol = ncol(clust_profiles))
tmp_metadata <- matrix(nrow = nrow(comb), ncol = nrow(clust_profiles))

print('Creating synthetic spots'); st_syn_spot <- Sys.time()
pb_for <- txtProgressBar(min = 0, max = nrow(comb), style = 3) # Progress bar

for (i in 1:nrow(comb)) {
  # Get how many cells of each type we have
  tt <- table(comb[i,][comb[i,]!=0])
  tmp_metadata[i,as.numeric(names(tt))] <- tt

  # Add all the profiles together
  row_i <- lapply(names(tt), function(nm){
    tmp_vec <- tt[[nm]]*clust_profiles[rownames(clust_profiles)[[as.numeric(nm)]],]
  }) %>% purrr::reduce(.,`+`)

  tmp_mtrx[i,] <- row_i/sum(tt)
  # update progress bar
  setTxtProgressBar(pb_for, i)
}
# rm(list(i,tt,row_i)) # For clean and good practice code, that there are no random tmp variables floating
close(pb_for)
print(sprintf('Creation of %s synthetic spot profiles took: %s minutes',
                          nrow(comb),
                          round(difftime(time1 = Sys.time(),time2 = st_syn_spot,units = 'mins'),2)))

syn_spots_ls <- list(tmp_mtrx,tmp_metadata)

Spot Deconvolution

Create all posible combinations between 3-8 cells per spot. We get the cell composition and the topic profile por each combination. We will compare the topic profiles of all these synthetic spots with the predicted topic profiles obtained from running the spatial spots through the LDA model. The prediction process can be parallelized to speed it up.

syn_spots_profiles <- syn_spots_ls[[1]]
syn_spots_metadata <- as.matrix(syn_spots_ls[[2]])
syn_spots_metadata[is.na(syn_spots_metadata)] <- 0

# Calculate all pairwise euclidean distances between the predicted and simulated topic profiles
dist <- pdist(X=prediction,Y=syn_spots_profiles)
dist_mtrx <- as.matrix(dist)

JSD_start <- Sys.time()

# Get list with indices of best euclidean distance for each predictions
JSD_indices <- top_n_predictions(dist_mtrx = dist_mtrx, n = top_dist)

#### Calculate JSD for the subset of best predictions according to Euclidean distance
mtrx_JSD_full <- suppressMessages(calculate_JSD_subset(prediction = prediction, syn_spots_profiles = syn_spots_profiles, JSD_indices = JSD_indices))

quants_JSD <- round(quantile(matrixStats::rowMins(mtrx_JSD_full,na.rm = TRUE),c(0.25,0.5,0.75)),5)
cat(sprintf("Quantiles of the JSD between the best synthetic spot profile and each spot's topic profile are - %s[%s-%s]", quants_JSD[[2]], quants_JSD[[1]], quants_JSD[[3]]))

# Get the index for each list from JSD_indices with the lowest JSD
top_JSD <- if_else(top_JSD <= top_dist, top_JSD, top_dist)

min15_error <- Rfast::rownth(x = mtrx_JSD_full, elems = rep(top_JSD, nrow(mtrx_JSD_full)), na.rm = TRUE)
min_indices_JSD <- lapply(1:length(min15_error), function(i){
  which(mtrx_JSD_full[i,] <= min15_error[i])
})

##### Get Spot composition #####
spot_composition_mtrx <- matrix(nrow = length(min_indices_JSD), ncol = ncol(syn_spots_metadata))
for (i in 1:nrow(spot_composition_mtrx)) {
  spot_composition_mtrx[i,] <- round(colMeans(syn_spots_metadata[JSD_indices[[i]][min_indices_JSD[[i]]],],na.rm = TRUE),0)
}